---
id: benchmarks-mmlu
title: MMLU
sidebar_label: MMLU
---

<head>
  <link rel="canonical" href="https://deepeval.com/docs/benchmarks-mmlu" />
</head>

import Equation from "@site/src/components/Equation";

**MMLU (Massive Multitask Language Understanding)** is a benchmark for evaluating LLMs through multiple-choice questions. These questions cover 57 subjects such as math, history, law, and ethics. For more information, [visit the MMLU GitHub page](https://github.com/hendrycks/test).

:::tip
`MMLU` covers a broad variety and depth of subjects, and is good at detecting areas where a model **may lack understanding** in a certain topic.
:::

## Arguments

There are **TWO** optional arguments when using the `MMLU` benchmark:

- [Optional] `tasks`: a list of tasks (`MMLUTask` enums), specifying which of the **57 subject** areas to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the `MMLUTask` enum can be found [here](#mmlu-tasks).
- [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This is set to **5 by default** and cannot exceed this number.

## Usage

The code below evaluates a custom `mistral_7b` model ([click here to learn how to use **ANY** custom LLM](/docs/benchmarks-introduction#benchmarking-your-llm)) and assesses its performance on High School Computer Science and Astronomy using 3-shot learning.

```python
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.mmlu.task import MMLUTask

# Define benchmark with specific tasks and shots
benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE, MMLUTask.ASTRONOMY],
    n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
```

The `overall_score` for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on **exact matching**, is calculated by determining the proportion of multiple-choice questions for which the model produces the precise correct letter answer (e.g. 'A') in relation to the total number of questions.

As a result, utilizing more few-shot prompts (`n_shots`) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

## MMLU Tasks

The MMLUTask enum classifies the diverse range of subject areas covered in the MMLU benchmark.

```python
from deepeval.benchmarks.tasks import MMLUTask

mm_tasks = [MMLUTask.HIGH_SCHOOL_EUROPEAN_HISTORY]
```

Below is the comprehensive list of all available tasks:

- `HIGH_SCHOOL_EUROPEAN_HISTORY`
- `BUSINESS_ETHICS`
- `CLINICAL_KNOWLEDGE`
- `MEDICAL_GENETICS`
- `HIGH_SCHOOL_US_HISTORY`
- `HIGH_SCHOOL_PHYSICS`
- `HIGH_SCHOOL_WORLD_HISTORY`
- `VIROLOGY`
- `HIGH_SCHOOL_MICROECONOMICS`
- `ECONOMETRICS`
- `COLLEGE_COMPUTER_SCIENCE`
- `HIGH_SCHOOL_BIOLOGY`
- `ABSTRACT_ALGEBRA`
- `PROFESSIONAL_ACCOUNTING`
- `PHILOSOPHY`
- `PROFESSIONAL_MEDICINE`
- `NUTRITION`
- `GLOBAL_FACTS`
- `MACHINE_LEARNING`
- `SECURITY_STUDIES`
- `PUBLIC_RELATIONS`
- `PROFESSIONAL_PSYCHOLOGY`
- `PREHISTORY`
- `ANATOMY`
- `HUMAN_SEXUALITY`
- `COLLEGE_MEDICINE`
- `HIGH_SCHOOL_GOVERNMENT_AND_POLITICS`
- `COLLEGE_CHEMISTRY`
- `LOGICAL_FALLACIES`
- `HIGH_SCHOOL_GEOGRAPHY`
- `ELEMENTARY_MATHEMATICS`
- `HUMAN_AGING`
- `COLLEGE_MATHEMATICS`
- `HIGH_SCHOOL_PSYCHOLOGY`
- `FORMAL_LOGIC`
- `HIGH_SCHOOL_STATISTICS`
- `INTERNATIONAL_LAW`
- `HIGH_SCHOOL_MATHEMATICS`
- `HIGH_SCHOOL_COMPUTER_SCIENCE`
- `CONCEPTUAL_PHYSICS`
- `MISCELLANEOUS`
- `HIGH_SCHOOL_CHEMISTRY`
- `MARKETING`
- `PROFESSIONAL_LAW`
- `MANAGEMENT`
- `COLLEGE_PHYSICS`
- `JURISPRUDENCE`
- `WORLD_RELIGIONS`
- `SOCIOLOGY`
- `US_FOREIGN_POLICY`
- `HIGH_SCHOOL_MACROECONOMICS`
- `COMPUTER_SECURITY`
- `MORAL_SCENARIOS`
- `MORAL_DISPUTES`
- `ELECTRICAL_ENGINEERING`
- `ASTRONOMY`
- `COLLEGE_BIOLOGY`
